Hardware Support to Reduce Overhead in Fine - Grain M edia Codes

نویسندگان

Deependra Talla

Lizy K. John

چکیده

The growing importance of media and media-like codes ha s caused general-purpose processors too incorporate SIMD-like extensions, such as MMX, SSE, and Alt iVec. While these media extensions do improve performance , significant parallelism in these codes remains u nexploited. In this paper, we propose and evaluate a programmable loopp engine (PLE for short) that executes media codes and kernels efficiently by movingg most of the overhead associated with the media programs intoo hardwar e. We study a range of media codes to determine which features and characteristics of these programs are sufficiently frequent to merit implementation in hardware. The commonn classes of operations that we find are multiple nestedd loopp control ,address generation, dataa transformations, and streamingg memory operations. We quantify the fraction of instructions that can be e limi-nated for our benchmarks, showing that it is over 65% on ave rage. We evaluate a design of one PLE, showingg how it can be integrated cleanly with a general-purpose pi peline, defining an instruction set interface, and quantifying its area and power overhead with a VHDL implementation. We find that the PLE requires a mere 10% of the area required by the MMX and SSE extensions. Further more, the PLE improves performance of a set of media kernels, over that of a 4-way SIMD processor, by more t han 7X on average, and accelerates a set of five media applications by an average of 54%. For all kernels andd 33 of 55 applications, a 4-wayy processor with the PLE out-performs an 8-wayy SIMD processor without the PLE. For al l the kernels and 22 of the 55 applications, a 4-wayy processor enhanced with the PLE outperforms even a 16-way processor.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Optimized RTL Code Generation from Coarse-Grain Dataflow Specification for Fast HW/SW Cosynthesis

This paper presents a new methodology of automatic RTL code generation from coarse-grain dataflow specification for fast HW/SW cosynthesis. A node in a coarse-grain dataflow specification represents a functional block such as FIR and DCT and an arc may deliver multiple data samples per block invocation, which complicates the problem and distinguishes it from behavioral synthesis problem. Given ...

متن کامل

Comparison of Context Switching Methods for Fine Grain Process Scheduling

Context switching times are a major source of overhead in medium to fine grain process scheduling. We compared three different context switching techniques for non-preemptive scheduling in context of hardware/software codesign, and found major differences in performance and code size efficiency.

متن کامل

Data ow-BasedLenient Implementation of a Function- al Language, Valid, onConventionalMulti-Processors

In this paper, we present a data ow-based lenient implementation of a functional language, Valid, on conventional multi-processors. A dataow execution scheme o ers a good basis to execute in a highly concurrent way a large number of ne grain function instances, created during the execution of a functional program. The lenient execution and split-phase operation will overlap the idle time caused...

متن کامل

Memory Latency Reduction with Fine-grain Migrating Threads in Numa Shared-memory Multiprocessors

In order to fully realize the potential performance benefits of large-scale NUMA shared memory multiprocessors, efficient techniques to reduce/tolerate long memory access latencies in such systems are to be developed. This paper discusses the concept, software and hardware support for memory latency reduction through fine-grain non-transparent thread migration, referred to as mobile multithread...

متن کامل

Efficient Synchronization for a Large-scale Multi-core Chip Architecture

Multi-core architectures are becoming mainstream, permitting increasing on-chip parallelism through hardware support for multithreading. Synchronization, especial finegrain synchronization, is essential to the effective utilization of the computational power of high-performance large-scale multi-core architectures. However, designing and implementing fine-grain synchronization in such architect...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2001

Hardware Support to Reduce Overhead in Fine - Grain M edia Codes

نویسندگان

چکیده

منابع مشابه

Optimized RTL Code Generation from Coarse-Grain Dataflow Specification for Fast HW/SW Cosynthesis

Comparison of Context Switching Methods for Fine Grain Process Scheduling

Data ow-BasedLenient Implementation of a Function- al Language, Valid, onConventionalMulti-Processors

Memory Latency Reduction with Fine-grain Migrating Threads in Numa Shared-memory Multiprocessors

Efficient Synchronization for a Large-scale Multi-core Chip Architecture

عنوان ژورنال:

اشتراک گذاری